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Abstract: Assumptions on a likelihood function, including a local Glivenko- 
Cantelli condition, imply the existence of M-estimators converging to an M- 
functional. Scatter matrix-valued estimators, defined on all empirical measures 
on R d for d > 2, and equivariant under all, including singular, affine transfor- 
mations, are shown to be constants times the sample covariance matrix. So, if 
weakly continuous, they must be identically 0. Results are stated on existence 
and differentiability of location and scatter functionals, defined on a weakly 
dense, weakly open set of laws, via elliptically symmetric t distributions on 
R d , following up on work of Kent, Tyler, and Diimbgen. 



1. Introduction 

In this paper a law will be a Borel probability measure on M. d . Let Afd be the set 
of all d X d nonncgative definite symmetric matrices and Vd C Afd the subset of 
strictly positive definite symmetric matrices. For (//, E) 6 = R d x Afd, [i will be 
viewed as a location parameter and E as a scatter parameter, extending the notions 
of mean vector and covariance matrix to arbitrarily heavy-tailed distributions. For 
d > 2, 6 may be taken to be V d or R d x V d . 

For a law P on M d , let Xi,X2, ... be i.i.d. (P) and let P n be the empirical 
measure where 5 X (A) := 1a{x) for any point x and set A. A class 

T C £ 1 (K d , P) is called a Glivenko-Cantelli class for P if 

(1) sup{| j fd(P n -P)\ : /£f}^0 

almost surely as n — > oo (if the supremum is measurable, as it will be in all cases 
considered in this paper). Talagrand [l^,[2l| characterized such classes. A class T 
of Borel measurable functions on R d is called a universal Glivenko-Cantelli class if 
it is a Glivenko-Cantelli class for all laws P on Mr, and a uniform Glivenko-Cantelli 
class if the convergence in ([TJ is uniform over all laws P. Rather general sufficient 
conditions for the universal Glivenko-Cantelli property and a characterization up 
to measurability of the uniform property have been given Q. 

Let p : (x,9) i— * p(x,0) € K defined for x € M d and 9 e 0, Borel measurable 
in x and lower semicontinuous in 9, i.e. p(x, 9) < liminf^g p(x, 4>) for all x and 9. 
For a law Q, let Qp((f>) := J p(x, 4>)dQ(x) if the integral is defined (not oo — oo), 
as it always will be if Q = P n . An M- estimate of 9 for a given n and P n will be a 9 n 
such that P n p{9) is minimized at 6 = 6 n , if it exists and is unique. A measurable 
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function, not necessarily denned a.s., whose values are M-estimates is called an M- 
estimator. An M-limit 9q = 9q(P) = 6q{P, p) (with respect to p) will mean a point 
of O such for every open neighborhood U of 9q, as n — > oo, 



where the given probabilities are assumed to be defined. Then if M-estimators exist 
(with probability -tlasmco), they must converge in probability to 9q(P). An 
M-limit 9q = 9q (P) with respect to p will be called definite iff for every neighborhood 
U of 6q there is an e > such that the outer probability 



as n — > oo. 

For a law P on R d and a given p(-, •), a 9\ = 9\(P) is called the M-functional of 
P for p if and only if there exists a measurable function a{x), called an adjustment 
function, such that for h(x, 9) = p{x, 9) — a{x), Ph(9) is defined and satisfies — oo < 
Ph(9) < +oo for all 9 £ 0, and is minimized uniquely at 9 = 9i(P), e.g. Hubcr 
[13]. As Huber showed, 9i(P) doesn't depend on the choice of a(-). Clearly, an 
M-estimate 9 n is the M-functional 9\{P n ) if either exists. 

A lower semicontinuous function / from into (— oo, +oo] will be called uni- 
minimal iff it has a unique relative minimum at a point 9q and for all t € K, 
{9 e : f(9) < t} is connected. For a differentiable function /, recall that a 
critical point of / is a point where the gradient of / is 0. 

Examples. On = R let f(x) = — (1— x 2 ) 2 . Then / has a unique relative minimum 
at x = 0, but no absolute minimum. It has two other critical points which are 
relative maxima. For t < the set where / < t is not connected. 

If / is a strictly convex function on R d attaining its minimum, then / is unimin- 
imal, as is 9 i ► /(a; — 0) for any x. So is 6> i— > / /(x — #) — f(x)dP(x) if it's defined 
and finite and attains its minimum for a law P, as will be true e.g. if f(x) = \x\ 2 
and J \x\dP(x) < oo, or for all P if / is also Lipschitz, e.g. f{x) = y/l + \x\ 2 . 

I have not found the notion here called "uniminimal" in the literature. Similar 
but more complex assumptions occur in some work on sufficient conditions for 
minimaxity in game theory, e.g. Thus, I claim no originality for the following 
easily proved fact. 

Proposition 1. Let (0, d) be a locally compact metric space. If f is uniminimal on 
(0,d), then (a) f attains its absolute minimum at its unique relative minimum 9o, 
and (b) For every neighborhood U of 9q there is an e > such that f(9) > f(9o) +e 
for all 9(£U. 

Proof. Clearly (b) implies (a). To prove (b), suppose that for some or equivalently 
all small enough 5 > and all n — 1,2,..., there are 9 n € with d(9 n , 9q) > 8 
and f(9 n ) < f(9 ) + 1/n. By connectedness, we can take d(9 n ,9o) = S for all 
n. Then for 5 > small enough, {9 : d(9,9o) < 8} is compact and there is a 
converging subsequence 9 n ^) — * 9s with d(9s,9o) — S and f(9g) < f{9o) by lower 
semicontinuity. Letting S J. we get a contradiction to the fact that 9o is a unique 
relative minimum. □ 

Theorem 2. Let (0,c?) be a connected locally compact metric space and (X,B,P) 
a probability space. Let h : 1x6^1 where for each 9 £ 0, h(-, 9) is measurable. 
Assume that: 




(3) 



(P n )* {M{P n p(9) : 9iU}<e + inf{P n( o(0) : G U}} -+ 
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(i) 9 i— > Ph{9) G (— oo, +oo] is well-defined and uniminimal on O, uwf/j minimum 
at 8q; 

(ii) Outside an event A n whose probability converges to as n — > oo ; P n h{-) is 
uniminimal on 0; 

(iii) For some neighborhood U of 9q, {h{-,9) : 9 U} is a Glivenko-Cantelli class 
for P. 

Then 9q is the definite M-limit for P and the M-functional 9\{P). 

Remark. Glivenko-Cantelli conditions on log likelihoods (and their partial deriva- 
tives through order 2) for parameters in bounded neighborhoods have been assumed 



in other work, e.g. 17J and 



Proof. That 9o is an M-functional for P follows from (i) and Proposition [T] By 
(iii), take 8 > small enough so that {h(-,8) : d(9,9o) < 8} is a Glivenko-Cantelli 
class for P. By (i) and Proposition [IJ take e > such that Ph(9) > Ph(6o) + 3e 
whenever d(8,9 ) > 8/2. Outside some events A n whose probability converges to 
as n -> oo, we have P n h(9 ) < Ph(9 ) + e and P n h(9) > Ph(9 ) + 2e for all 
9 with 8/2 < d(8,$o) < 8. Then by (ii), also with probability converging to 1, 
P n h{9) > P n h(9 ) + e for all 9 with d(9, O ) > 8/2, proving © and the theorem. □ 

A class C of subsets of a set X is called a VC (Vapnik-Chervonenkis) class if for 
some k < oo, for every subset A of X with k elements, there is some B C A with 
B 7^ C n A for all C € C, e.g. 0, Chapter 4]. A class T of real- valued functions on 
X is called a major class iff {{a; € A : /(x) > t} : f £ J 7 , t e R} is a VC class 
of sets (e.g. 0, Section 4.7]). In the following, local compactness is stronger than 
needed but holds for the parameter spaces being considered. 

Theorem 3. Let h(x, 9) be continuous in 9 G for each x and measurable in x for 
each 9 where is a locally compact separable metric space. Let /i(-, •) be uniformly 
bounded and let T := \h{-,6) : 9 G 0} be a VC major class of functions. Then 
T is a uniform, thus universal, Glivenko-Cantelli class. 

Proof. Theorem 6 of Q applies: sufficient bounds for the Koltchinskii-Pollard en- 
tropy of uniformly bounded VC major classes of functions are given in [3|, Theorem 
2.1(a), Corollary 5.8], and sufficient measurability of the class T follows from the 
continuity in 9 and the assumptions on 0. □ 

For the t location-scatter functionals in Sections 2] and the notions of VC major 
class, and local Glivenko-Cantelli class as in Theorem HJiii), will be applicable. But 
as shown by Kent, Tyler and Vardi to be recalled after Theorem [T2T iii) . some 
parts of the development work only for t functionals, rather than for functions p 
satisfying general properties such as convexity. 



2. Equivariance for location and scatter 

Notions of "location" and "scale" or multidimensional "scatter" functional will be 
defined along with equivariance, as follows. 

Definitions. Let Q i— » fJ,(Q) G K d , resp. S(Q) G Afd, be a functional defined on a 
set V of laws Q on M. d . Then (resp. S) is called an affinely equivariant location 
(resp. scatter) functional iff for any nonsingular d x d matrix A and v G M. d , with 
f(x) := Ax + v, and any law Q G T>, the image measure P := Q o / _1 G T> also, 
with (i(P) = Afi(Q) + v or, respectively, S(P) = AE(Q)A'. For d = 1, cr(-) with 
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< a < oo will be called an affinely equivariant scale functional iff a 2 satisfies the 
definition of affinely equivariant scatter functional. 

Well-known examples of affinely equivariant location and scale functionals (for 
d = 1), defined for all laws, are the median and MAD (median absolute deviation), 
where for a real random variable A with median m, the MAD of A or its distribution 
is defined as the median of \X — m\. 

Call a location functional /i(-) or a scatter functional S(-) singularly affine equiv- 
ariant if in the definition of affine equivariance A can be any matrix, possibly 
singular. If a functional is defined on all laws, affinely equivariant, and weakly con- 
tinuous, then it must be singularly affine equivariant. The classical sample mean 
and covariance are defined for all P n and singularly affine equivariant. It turns out 
that in dimension d > 2, there are essentially no other singularly affine equivariant 
location or scatter functionals defined for all P n , and so weak continuity at all laws 
is not possible. First the known fact for location will be recalled, then an at least 
partially known fact for scatter will be stated and proved. 

Let X be a d x n data matrix whose jth column is Xj G M. d . Let X 1 be the ith 
row of X. Let 1„ be the n x 1 vector with all components 1. Let X = J xdP n be 
the sample mean vector in M. d , so that X — Xl' n is the centered data matrix. Note 
that P n , and thus X and E(A), are preserved by any permutation of the columns 
of X. 

Theorem 4. (a) // /i(-) is a singularly affine equivariant location functional defined 
for all P n on R d for d > 2 and a fixed n, then n(P n ) = A. 

(b) // in addition /i(-) is defined for all n and all P n on R d , then as n varies, /z(-) 
is not weakly continuous. Thus, there is no affinely equivariant, weakly continuous 
location functional defined on all laws on M. d for d > 2 . 

Proof. Part (a) follows from work of Obenchain (l8l . Lemma 1] and permutation 
invariance, as noted e.g. by Rousseeuw Then (b) follows directly, for x x = n, 
X2 = ■ ■ ■ = x n = 0, n — ► oo. □ 

Next is a related fact about scatter functionals. Davies [H, p. 1879] made a 
statement closely related to part (b), strong but not quite in the same generality, 
and very briefly suggested a proof. I don't know a reference for part (a), or an 
explicit one for (b), so a proof will be given. 

Theorem 5. (a) Let £(•) be a singularly affine equivariant scatter functional de- 
fined on all empirical measures P n on M. d for d > 2 and some fixed n > 2. Write 
E(A) := Ti{P n ). Then there is a constant c n > 0, depending on £(•), such that 
for any A, E(A — Xl' n ) = c n (A — A1^)(A — Xl'„)'. In other words, applied to 
centered data matrices, S is proportional to the sample covariance matrix. 
(b) //E(-) is an affinely equivariant scatter functional defined for all n and P n on 
M. d for d > 2, weakly continuous as a function of P n , then S = 0. 

Proof, (a) We have E(BA) = BY,(X)B' for any d x d matrix B. For any U,V G K" 
let A 1 = U', X 2 = V, and (U, V) := Si 2 (A). Then (•,•) is well-defined, letting 
Bn = £>22 = 1 and Bij = otherwise. It will be shown that (•, •) is a semi-inner 
product. We have (U, V) = (V, U) via B with Bn = -B21 = 1 and B^ = otherwise, 
since E is symmetric. For B\\ = B21 — 1 and B^ = otherwise we get for any 
U G M" that 



(4) (U,U) = E 12 (BX) = (BE(A)B')i2 = S n (A) > 0. 
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For constants a and b, (aU,bV) = ab(U,V) follows for B\\ = a, B22 = b, and 
Bij = otherwise. It remains to prove biadditivity (Z7, V + W) = (U, V) + (U, W). 
For d > 3 this is easy, letting X 3 — W, Bn = B22 = B23 = 1, and B^ = 
otherwise. For d = 2, we first get (U + V, V) = (U,V) + (V,V) from B = (J {). 
Symmetrically, (U, U + V) = (U, U) + (U, V). Then from B=(\\) we get 

(5) (U + V,U + V) = (U, U) + 2(17, V) + (V, V). 

Letting \\W\\ 2 := (W,W) for any W S K" we get the parallelogram law \\U + 
y|| 2 + ||[/-y|| 2 = 2||f/|| 2 + 2||^|| 2 . (But || - 1| has not yet been shown to be a norm.) 
Applying this repeatedly we get for any W 1 Y, and Z £ R" that 

\\W + Y + Z\\ 2 - \\W-Y-Z\\ 2 = \\W + Y\\ 2 -\\W-Y\\ 2 + \\W + Z\\ 2 -\\W-Z\\ 2 , 

letting first U = W + Y, V = Z, then U = W-Z, V = Y, then U = W, V = Z, and 
lastly U = W, V = Y. Applying © gives (W, Y + Z) = (W, Y) + (W, Z) , the desired 
biadditivity. So (•, •) is indeed a semi-inner product, i.e. there is a C(n) € Af n such 
that (£/, V) = U'C(n)V . By permutation invariance, there are numbers a n > and 
b n such that C(n)u = a n for all i = 1, . . . , n and C(ri)ij = b n for all i =/= j. 

Let c„ := a n — b n and let £ be the ith standard unit vector. For each 
y e K n let y = Y%=i Vi e i and V '■= k Yh=i Vi- Tlien for an y z e R "' 

n 

(y - yln, z - ^l n ) = /J C{n)ij{Vi - y)(zj - z) = c n (y-yl n Y(z-zl n ). 

For 1 < j < fc < d, let J5i r := 5 r7r (i) for a function 7r from {1,2, ...,d} into 
itself with tt(1) = j and tt(2) = k. Then (Bl) 1 = X j and (BX) 2 = X k . Thus 
(Xi,X k ) = Si 2 (BA) = Sj fc (X), recalling Q if j = fc. 

Let have ith component X . Then 

X jk (X-Xl' n ) = [X* -x i i n ,x k -x k i n ) = c, l (XJ-x- ? i„) / (x fe -T ,: i n ), 

where c„ > is seen when j = k and the coefficient of c„ is strictly positive, as it 
can be since n > 2. Thus part (a) is proved. 

For part (b), consider empirical measures P n = P mn , so that each Xj in P n is 
repeated m times in P mn . Since the X's and Ss for P„ and P mn must be the same, 
we get that c mn — c n /m which likewise equals c rn /n. Thus there is a constant c\ 
such that c„ = ci/n for all n. 

Let X\\ '■= —X\2 '■= let Xij = for all other i,j and let n — > 00. Then 
X = 0, P n — > Sq weakly, and E(<5o) is the matrix by singular affine equivariance 
with B = 0, but E(P n ) don't converge to unless c\ = and so c„ = for all n, 
proving (b). □ 

So, for d > 2, affinely equivariant location and non-zero scatter functionals, 
weakly continuous on their domains, can't be defined on all laws. They can be 
defined on weakly dense and open domains, as will be seen in Theorem 1121 on 
which they can have good differentiability properties, as seen in Section [5j 

3. Multivariate scatter 

This section treats pure scatter in R d , with — Vd- Results of Kent and Tyler 
fll| for finite samples, to be recalled, are extended to general laws on R d in @, 
Section 3]. 
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For A G Vd and a function p from [0, oo) into itself, consider the function 

(6) L(y, A) := i log detA + p(y'A~ 1 y), y eR d . 
For adjustment, let 

(7) h(y,A) := L(y,A)-L(y,I) 
where I is the identity matrix. Then 

(8) Qh(A) = ilogdetA+ / piv'A^y) - p(y'y) dQ(y) 

if the integral is defined. We have the following, shown for Q — Q n an empirical 
measure in (1.3)] and for general Q in @, Section 3]. Here © is a redescending 
condition. A symmetric d x d matrix A will be parameterized by the entries A^ 
for 1 < i < j < d. Thus in taking a partial derivative of a function f(A) with 
respect to an entry Aij , Aji = Aij will vary while Aki will remain fixed except for 
(k,l) = {i,j) or 

Proposition 6. Let p be continuous from [0, oo) into itself and have a bounded 
continuous derivative, where p'(0) '■— p'(0+) := ]im x io[p(x) — p(0)]/x. Let 
< u(x) := 2p'(x) for x > 0. Assume that 

(9) sup xu(x) < oo. 

0<:c<oo 

Then for each law Q on R d , Qh in flj) is well defined and is a C 1 function of the 
entries of A. Here Qh has a critical point at A = B if and only if 

(10) B = J u(y'B- 1 y)yy'dQ(y). 
The following, proved in 0, Section 3], extends to any law Q the uniqueness part 



of [15|, Theorem 2.2]. 

Proposition 7. Under the hypotheses of Proposition^ if in addition u(-) is non- 
increasing and s i—* su(s) is strictly increasing on [0, oo), then for any law Q on 
M. d , Qh has at most one critical point A £ Vd- 

A sufficient condition for existence of a pure scatter M-functional A(Q) will in- 
clude the following assumption from [H, (2.4)]. Given a function u(-) as in Propo- 
sition O let ao := ao(u(-)) := sup s>0 su(s). Since s i— > su(s) is increasing, it 
follows that 

(11) su(s)^ao as s| +oo. 

Kent and Tyler [l5| gave the following condition for empirical measures. 

Definition. Given ao := a(0) > 0, let Ud,a(o) denote the set of all laws Q on R d 
such that for every proper linear subspace H of M. d , of dimension q < d—1, we have 
Q(H) <l-(d-g)/oo. 

Note that Ud. a (o) is weakly open and dense and contains all laws with densities. 
If Q G Ud. a {o)i then Q({0}) < 1 — (d/ao), which is impossible if ao < d. So in 
the next theorem we assume ao > d. In part (b), the existence of a unique B(Q n ) 
minimizing Q n h for an empirical Q n € Ud ia (oi_was proved in [TEl Theorems 2.1 and 
2.2]. For a general Q G Ud,a(o) it's proved in [6|, Section 3]; one lemma useful in the 
proof is proved here. 
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Theorem 8. Under the assumptions of Propositions [6] and [7j for a(0) = ao as in 

m, 

(a) If Q ^ W(j )0 (o)) i/ien Qh has no critical points. 

(b) If ao > d and Q S Wd >a (o)j i/ien Qft. attains its minimum at a unique B = 
B{Q) £ Vd and has no other critical points. 

A proof of the theorem uses a fact about probabilities of proper subspaces or 
hyperplanes. A related statement is Lemma 5.1 of Dumbgen and Tyler 1 101 ] . 

Lemma 9. Let V be a real vector space with a a-algebra B for which all finite- 
dimensional hyperplanes H = x + T := {x + u : uG T} for finite- dimensional 
vector subspaces T are measurable. Let Q be a probability measure on B and let 
TLj be the collection of all j-dimensional hyperplanes in V. Then for each j 
(). 1.2..... for any infinite sequence {Ci} of distinct hyperplanes in Hj such that 
Q{Ci) converges, its limit must be Q(F) for some hyperplane F of dimension less 
than j such that F C Ci for infinitely many i. In particular, Q{Ci) cannot be strictly 
increasing. The same is true for vector subspaces in place of hyperplanes. 

Proof. Hyperplanes of dimension are singletons {x}. The empty set will be 
considered as a hyperplane of dimension —1. Let W—i := 0. Claim T. For each 
j = 0, 1, . . . , there exists a finite or countable sequence {Vji} C Hj such that for 
Wj := Wj-i U 1J 4 Vji, Q(V \ Wj) = for all V G H r Let V oi = {x t } for some 
unique i if and only if Q{{xi}) > 0. The set of such Xi is clearly countable. Let 
W := U 4 V 0l = {xeV : Q({x}) > 0}. Clearly, for any x £ V, Q({x} \ Wo) = 0. 
Recursively, for j > 1, assuming Wj-i has the given properties, suppose for r = 1,2, 
H r e Hj and Q(H r \ Wj-i) > 0. If Hi ^ H 2 , then Hi n H 2 is a hyperplane of 
dimension at most so Q(Hi<~)H2\Wj-i) = and the sets H r \Wj-i are disjoint 
up to sets with Q = 0. Thus there are at most countably many different H r e Hj 
with Q(H r \Wj-i) > 0. Let V jr := H r for such H r and set Wj := W}_iUU r V jr . 
It's then clear that for any H 6 Hj, Q(H \ Wj) — 0, so the recursion can continue 
and Claim 1 is proved. 

Claim 2 is that if C is any hyperplane of dimension j or larger, and s = 0, 1, . . . , j, 
then for each r, either C D V sr or Q(C H (V sr \ W s _i)) = 0. If C doesn't include 
V sr , then C n V sr is a hyperplane of dimension < s — 1, and so included in W s -i 
up to a set with Q — 0, so Claim 2 follows. 

Now, given distinct Ci € 7ij with Q(Ci) converging, let B be a hyperplane 
of largest possible dimension b included in Ci for infinitely many i. Then b < j. 
Taking a subsequence, we can assume that B c Ci for all i. Claim 3 is that then 
Q(Ci \ B) — > as i — > oo. For any s = 0, 1, ... , j — 1, and each r, by Claim 2, 
if Ci D for infinitely many i, then V^ r C -B, since otherwise Ci includes the 
smallest hyperplane including V sr and B, which has dimension larger than b, a 
contradiction. So Unij Q((Ci \ B) n (14 r \ VF s _i)) = for each s < j and r. It 
follows by induction on s that Q(Ci D W s \ B) — > as i — > oo for s = 0, 1, . . . , j — 1. 

By the proof of Claim 1, the sets Ci \ Wj-i are disjoint up to sets with (5 = 0, 
so Claim 3 follows, and so the statement of the lemma for hyperplanes. The proof 
for vector subspaces is parallel and easier. The fact that Q(Ci) cannot be strictly 
increasing then clearly follows, as a subsequence would also be strictly increasing. 
So the lemma is proved. □ 

Dumbgen and Tyler [lo| . Lemma 5.1 show that sup{Q(F) : V € Hj} is attained 
for each Q and j and is weakly upper semicontinuous in Q. 
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4. Location and scatter t functionals 

As Kent and Tyler [HI, Section 3] and Kent, Tyler and Vardi [l6[ showed, (t) 
location-scatter estimation in R d can be reduced to pure scatter estimation in 
beginning with the following. 

Proposition 10. (i) For any d = 1,2,..., there is a 1-1 correspondence, C°° in 
either direction, between matrices A £ Vd+i and triples (E, ji, 7) where E £ Vd, 
p G M. d , and 7 > 7 given by 



(12) A = A(E,^, 7 )=7 



E + fifi' p, 
fi' 1 



The same holds for A £ Vd+i with 7 = Ad+i,d+i = 1 and pairs (/x, E) G x P d . 
(ii) J/ (|12ll holds, then for any y £ M. d (a column vector), 

(13) (j/, l)^', 1)' = 7- 1 (1 + (y - AiJ'E- 1 ^ - p)) ■ 

For M-estimation of location and scatter in K d , we will have a function p : 
[0,oo) 1— > [0,oo) as in the previous section. The parameter space is now the set of 
pairs (p, E) for p £ M. d and E £ Vd, and we have a multivariate p function 

p(y,(p,X)) := ilogdetE + p((y- M )'E^ 1 (y- A1 )). 

For any /lef 1 and E G V d let A := A (/x,E) := A(T,,p, 1) G by fT2"| 

with 7=1, noting that det A a = det E. Now p can be adjusted, in light of © and 
(flUl) , by defining 

(14) := p(»,(A*,E))-p(l/'l/). 

Laws P on M. d correspond to laws Q := Po T^ 1 on R d+1 concentrated in 
{y : yd+i = 1}, where Ti(y) := (y', 1)' G R d+1 , y £ K d . We will need a hypothesis 
on P corresponding to Q £ Ud+i !a (o)- Kent and Tyler gave these conditions for 
empirical measures. 

Definition. For any oq > let Vd, a (o) be the set of all laws P on M. d such that 
P(J) < 1 — (rf — q)/ao for every affine hyperplane J of dimension q < d. 

The next fact is rather easy to prove. Here a > d + 1 avoids the contradictory 
Q({0})<0. 

Proposition 11. If P is a law on R d , a> d+1, and Q := P o T" 1 ok K d +\ 
t/ien P G Vd,a 2/ onZy if Q £ Md+i,a- 

A family of p functions for which 7 = 1 automatically, as noted by Kent and 
Tyler [la . (1.5), (1.6), Section 4], is given by elliptically symmetric multivariate t 
densities with v degrees of freedom as follows: for < v < 00 and < s < 00 let 

(15) Pu{s) := p u ,d(s) := log 



For this p, u is u v (s) := u v ,d{s) := {v + d)/(v + s), which is decreasing, and 
s 1— ► su v ^d(s) is strictly increasing and bounded, i.e. ([9]) holds, with supremum and 
limit at +00 equal to ao. u :— clq{u v {-)) = v + d. 

The following fact is in part given by Kent and Tyler 15j and further by Kent, 
Tyler and Vardi [16j , for empirical measures; equation (|16p was not found explicitly 
in either. Here a proof will be given for any P G Vd,u+d-> assuming Theorem [8] and 
Propositions [5] and [TUl 
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Theorem 12. For any d = 1,2, ... , v > 1, law P on R d , and Q = P o T± on 

M. d+1 , letting v' := v — 1, assuming P £ Vd,u+d in parts (a) through (e), 

(a) For A £ Vd+i> A i— ► Qh(A) defined by $8$ for p = p p > , d +i has a unique critical 
point A(v') := A v i(Q) which is an absolute minimum; 

(b) A{ V ') d+1 . d+1 = Ju u , A+1 (y'A(iy'r 1 y)dQ(y) = 1; 

(c) For any p £ R d and £ £ V d let A = A(E,p, 1) £ V d +i in [Wj) . Then for any 
y £ M d and z :— (y 1 , 1)' , we have 

(16) V,d+i(z' 'A^ 1 z) = u u ^ d ((y - p)'YT 1 (y - p)). 

In particular, this holds for A = A{v') and its corresponding p = p v € R d and 

£ = E„ £ V d . 

(d) 

(17) / u v , d ((y - iiu)'K l {v - »u))dP{y) = 1. 



(e) -For /i defined by \1J$ with p — p v . d , (p„, Ej,) is the M-functional 9\ for P. 

(f) //, on the other hand, P ^ V d ^+ d , then (p, £) i— > Ph{p,Ti) for h as in (e) has 
no critical points. 

Proof, (a): Theorem [8jb) applies since u v ^ d+ i satisfies its hypotheses, with 
do{u V ' ,d+i) = z/ + d+ l = z/ + d>d+l. 

(b) : By (fT0|) . multiplying by A(^')~ 1 and taking the trace gives 

d+l = J u v , A+1 {z' A{v'Y 1 z) z'A{v'Y x zdQ{z). 

We also have, since z d+1 = 1, that A(v')d+i,d+i = J u v i i d+x{z l A{y')^ 1 z)dQ{z). For 
any s > 0, we have s?v,d+i( s ) + 2/iv,d+i(s) = v + d. Combining gives 

d + l = v + d-v' J u v > t d+i (z'A(i''y 1 z) dQ(z), 

and (b) follows. 

(c) : We can just apply (fT3)) with 7 = 1, and for A = A{v'), part (b). 

(d) : This follows from (b) and (c). 

(e) : By Proposition IT0| for 7 = 1 fixed, the relation (fT2|) is a homeomorphism 
between {A £ V d+ i : = 1} and {(p,E) : p £ R d , £ £ V d }. So this also 
follows from Theorem [3] 

(f) : We have u + d > d + 1, so Q ^ Ud+i^+d by Proposition [TTJ By Theorem 
[8ja), Qft, defined by ^ for p — p U ',d+i nas n0 critical point A. Suppose Ph has 
a critical point (/x, £) for p = p,,^- Let A := A(£, pi, 1) € 'Pd+i- By an afhnc 
transformation we can assume p = and £ = 7^, the d x d identity matrix, so 
A = Id+i- Equations for £ = to be a critical point can be written in the form 
d/diT,- 1 ),^ = 0, 1 < i < j < d. By JIBJ it follows easily that equation ((TUJ) holds 
for B = A and it = v,ii+i with the possible exception of the (d + 1, d + 1) entry. 
Summing the equations for the diagonal (i, i) entries for i = 1, . . . , d, it follows that 
the (d+ 1, d+ 1) equation and so (fT0|) holds. By Proposition [6l we get that A is a 
critical point of the given Qh, a contradiction. □ 



Kent, Tyler and Vardi 16|, Theorem 3.1] show that if u(s) > 0, u(0) < +00, u(-) 
is continuous and nonincreasing for s > 0, and su(s) is nondecreasing for s > 0, up 
to oq > d, and if (|17p holds with u in place of u v . d at each critical point (p, E) of 
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Qh, then u must be of the form u(s) — u Vj d(s) = (y + d)/{v + s) for some v > 0. 
Thus, the method of relating pure scatter functionals in E d+1 to location-scatter 
functionals in R d given by Theorem [T2l for t functionals defined by functions u v ,d 
does not extend directly to any other functions u. 

When d = 1, P € Vi,„+i requires that P({x}) < v/(\ + v) for each point x. Then 
E reduces to a number a 2 with a > 0. If v > 1, and P ^ Vi, u +i, then for some 
unique x, P({x}) > v / ' (v + 1). One can extend (/i„, a v ) by setting [i v {P) := a; and 
<Jv{P) '■= 0, with (/i^av) then being weakly continuous at all P [g, Section 6]. 

The t v functionals (fi u , E„) defined in this section can't have a weakly continuous 
extension to all laws for d > 2, because such an extension of /i„ would give a 
weakly continuous affinely equivariant location functional defined for all laws, which 
is impossible by Theorem HJJb). Here is an example showing that for d — 2 and 
empirical laws with n = 6, invariant under interchanging x and — x, and/or y and 
— y, so that an affinely equivariant fi must be 0, there is no continuous extension of 
the scatter matrix E„ to laws concentrated in lines. For k = 1,2,... let 

: = g [^(-i,-i/fc) +<*(-i,i/fc) +^(i,i/k)] + 2^(0,0)) 

Q (k) : = g [ 2 ^(-i,o) + ^(0,-i/fc) + ^(o,i/fc) + 2 ^(i,o)] ■ 

Then for each v > 1, all members of both sequences have mass < 1/3 < v/(v + 2) 
at each point and mass < 2/3 < [y + l)/{v + 2) on each line, so are in U 2 , v +2 
and the functionals Ej, are defined for them. By the symmetries x <-> — x and 
y <-> — y, /i„ = and E„ is diagonal on all these laws. Both sequences converge 
to the limit P = A [<5(-i.o) + <5(o.o) + <5(i.o)] ? which is concentrated in a line and so 
is not in W 2iJ ,+2 for any v. E„(pW) converges to ([, ®) but E^Q^) converges 
to g) where a(i/) := 2(1 - i/" 1 )/ 3 7^ 6(z/) := (2 + z/" 1 )/ 3 - We also have 
X^QW) = ) with c(v) = i(l - i/ -1 ), so that, in contrast to Theorem ^ a), 

E„ is not proportional to the covariance matrix (j^ 3 w 3 ) for any < 00, but E„ 
converges to the covariance as v — > +00, as is not surprising since the t v distribution 
converges to a normal one. 



5. Differentiability of t functionals 

Let (S, e) be any separable metric space, in our case R d with its usual Euclidean 
metric. Recall the space BL(S, e) of all bounded Lipschitz real functions on S, with 
its norm The dual Banach space BL*(S, e) has the dual norm H^H^, which 

metrizes the weak topology on probability measures @, Theorem 11.3.3]. 

Let V be an open set in a Euclidean space M. d . For k = 1, 2, . . . , let C£(V) be 
the space of all real-valued functions / on V such that all partial derivatives D p f, 
for D p := jdx 91 ■ ■ ■ dx p d d and < [p] := p\ + ■ ■ ■ + pd < k, are continuous and 
bounded on V. On C£(V) we have the norm 

(18) ll/IU.v := \\ DP f\\sn P ,v, where ||ff|| 8U p,v : = sup \g(x)\. 

o<[ P ]<k x£V 

Then (C*(V), \\'\\k,v) is a Banach space. For k = 1 and V convex in W 1 it's straight- 
forward that C X (V) is a subspace of BL(V, e), with the same norm for d — 1 and 
an equivalent one if d > 1. 
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Substituting p v ^ d from (fT5|) into ([6]) gives for y £ R d and A £ V d , 
L v . d {y,A) := IlogdetA+-^i^log[l + ^-yA- 1 y] , 

so that in ||7J) we get 

h u (y,A) := h v ,d(y,A) := L Vj d{y,A) — L Vtd (y,I). 

Differentiating with respect to entries where C — A^ 1 , and recalling w„ i( /(s) 
(y + d) / (v + s) , we get as shown in 0, Section 5] 

^ ig ^ dh UjC i(y,A) _ dL v ^ d {y,A) _ Ay t (y + d)yiy 



ddj ddj l + Sij (l + Sij^v + y'Cy) 

For < S < 1 and d = 1, 2, . . . , let 

W« := W M := {4eP d : max(||A||, H^H) < 1/5}. 
The following is proved in p, Section 5]. 

Lemma 13. For any S £ (0, 1) let U := U s := M. d x W s ,d- Let A £ V d 
be parameterized by the entries Cm of C = A . For any v > 1, the functions 
dL Vjd /dC kl in Q3JJ) are m C^L/j) /or aZZ j = 1, 2, . . .. 

To treat t functionals of location and scatter in any dimension p we will need 
functionals of pure scatter in dimension p + 1, so in the following lemma we only 
need dimension d > 2. The next lemma, proved in 0, Section 5], helps to show 
differentiability of t functionals via implicit function theorems, as it implies that 
the derivative of the gradient (the Hessian) of Qh is non-singular at a critical point. 
Let T(d) := : 1 < i < j < d}. 

Lemma 14. For each u > 0, d = 2, 3, . . . , and Q £ U d ^ +d , let A(y) = A V (Q) £ V d 
be the unique critical point of Qh(-) defined by @) for p — p vd defined by H15\) . For 
C = A , the Hessian matrix d 2 Qh(A) / dCijdCki with rows indexed by (i,j) £ T{d) 
and columns by (k,l) £ T(d) is positive definite at A = A(u). 

For any v > and A £ V d , let L, t j lV , (x, A) :— dh v ^ d (x, A) / 'dCij from (fTO)) . Let 
X := BL*(M. d ,e) for the usual metric e(s,t) := \s — t\. Again, parameterize 
A£V d with inverse C by {C»j}i<i<j<d € M !i ( d + 1 )/ 2 . Consider the open set 9 := 
V d C M d ( d+1 )/ 2 and the function F := F v from X x 6 into M d ( d + 1 )/ 2 defined by 

(20) F(0,A) := {^(£ ij >(- s A))}i< i < J -< (i . 

Then F is well-defined because Lij M {-,A) are all bounded and Lipschitz functions 
of x for each A e 9; in fact, they are C 1 with bounded derivatives equal except 
possibly for signs to second partials of L v d with respect to Cy. The next fact, 
proved in 0, Section 5] , uses some basic notions and facts from infinite-dimensional 
calculus, given in 0] and reviewed in the Appendix of (f|. 

Theorem 15. Let X := BL*(R d ,e). In parts (a) through (c) } let v > 0. 

(a) The function F — F v is C°° (for Frechet differentiability) from X x into 

M d(d+l)/2_ 

(b) Let Q £lld,v+d, and take the corresponding </)q £ X. AtA v {Q) ; the d(d+l)/2x 
d(d + l)/2 matrix dF((j)Q, A)/ dC := {dF((f>Q, A) / dCu}i<k<i<d is invertible. 

(c) The functional Q i— > ^4„(<2) is C°° /or f/ie FL* norm on Ud,u+d- 

(d) For eac/i f > 1, </ie functional P i— > (/ii,,X„)(F) given by Theorems^ and \12\ 
is C°° on Vdv+d f or the BL* norm. 
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To prove asymptotic normality of v / n(T(P„) — T(P)) for T — £„), the dual- 
bounded-Lipschitz norm || • \\* BL is too strong for some heavy-tailed distributions. 
Gine and Zinn [12} proved that for d = 1, {/ : ||/||sL<l}isa P-Donsker class if 
and only if YlJLi P r (j — 1 < \X\ < j) 1 ^ 2 < oo for X with distribution P. To define 
norms better suited to present purposes, for S > and r = 1, 2, . . ., let Fs, r be the 
set of all functions of y appearing in (|19p and their partial derivatives with respect 
to Cij through order r, for any A S Ws- Then each T$. r is a uniformly bounded VC 
major class as in Theorem |3] Let Ys, r be the linear span of Fs, r - Let X$ iT . be the 
set of all real-valued linear functionals on for which ||</>||,5,r := s up{|<^>(/)| : 
/ 6 ^5,r} < oo. For A G W^d and S X$, r , define F(<f>,A) again by 103), which 
makes sense since each Li t j tV (-, A) G J-$, r for any r = 0, 1, 2, ... by definition. 

The next two theorems are also proved in [|| Section 5] . Theorem [TJJ is a delta- 
method fact. 

Theorem 16. Let < S < 1. For any positive integers d and r, Theorem 1151 /tolas 
/or X = X^ r +3 in pZace of BL*(R d , e), Ws,d in place of Q, and C r in place of C°° 
wherever it appears (parts (a), (c), and (d)). 

Theorem 17. (a) For any d — 2,3,... and v > 0, let Q G Ud.y+d- Then the 
empirical measures Q n G Ud,u+d with probability — > 1 as n — > oo and y/n(A v (Q n ) — 
A U (Q)) converges in distribution to a normal distribution with mean on R d ( d+1 )/ 2 
if A G Vd is parameterized by {Aij}x<i<j<d, or a different normal distribution for 
the parameterization by {A^j }i<i<j<d o,s above. The limit distributions can also be 

taken on R , concentrated on symmetric matrices. 

(b) Let d — 1,2,... and 1 < v < oo. For any P G Vd,v+d> ^ e empirical measures 
Pn G Vd,v+d with probability — » 1 as n — * oo and i/ie functionals /i„ and £„ are 
such that as n — > oo, y 7 /! S </ )(P n ) — (fj, u , E i/ )(P)] converges in distribution to 
some normal distribution with mean on R d x R d , whose marginal on M. d is 
concentrated on d x d symmetric matrices. 

Now, here is a statement on uniformity as P and Q vary, proved in 0, Section 

5]. 

Proposition 18. For any 5 > and M < oo, the rate of convergence to normality 
in Theorem \l% &) is uniform over the set Q :— Q(S,M) of all Q G Ud,u+d such 
that A V {Q) G Ws and 

(21) Q({z: \z\>M})<(l-6)/(v + d), 

or in part (b), over all P G Vd,u+d such that X„(P) G Ws and \21\l holds for P in 
place of Q. 
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